Skip to content

Conversation

@janjust
Copy link
Contributor

@janjust janjust commented Oct 24, 2025

This commit completely removes the mca/coll/hcoll component from the Open MPI source code. The hcoll component provided integration with Mellanox's Hierarchical Collectives library for collective operation offload.

Changes include:

  • Removed ompi/mca/coll/hcoll component directory and all source files
  • Removed config/ompi_check_libhcoll.m4 configuration macro
  • Updated coll_tags.h to remove HCOLL tag space definitions
  • Updated coll_base_allgather.c and coll_base_allgatherv.c to use NEIGHBOR tag space instead of HCOLL tag space
  • Removed hcoll references from platform configuration files
  • Removed hcoll documentation and configuration options
  • Removed hcoll references from code comments

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove hcoll from all the docs.

@janjust
Copy link
Contributor Author

janjust commented Oct 27, 2025

Please also remove hcoll from all the docs.

This PRs removes any reference to HCOLL in main, at least grepping for hcoll/hcol or HCOLL/HCOL I don't see anything, so we should be good.

jsquyres
jsquyres previously approved these changes Oct 27, 2025
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PRs removes any reference to HCOLL in main, at least grepping for hcoll/hcol or HCOLL/HCOL I don't see anything, so we should be good.

I'm a dope -- I did my grep wrong. This PR looks good.


/* Since each process sends several non-contiguos blocks of data, each block sent (and therefore each send and recv call) needs a different tag. */
/* As base OpenMPI only provides one tag for allgather, we are forced to use a tag space from other components in the send and recv calls */
MCA_PML_CALL(isend(tmpsend + (ptrdiff_t) send_disp * scount * rext, scount, rdtype, sendto, MCA_COLL_BASE_TAG_HCOLL_BASE - send_disp, MCA_PML_BASE_SEND_STANDARD, comm, requests + transfer_count));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is problematic:

  1. for jobs with more than 1024 ranks we will use tags into an unreserved space (outside the [ MCA_COLL_BASE_TAG_NEIGHBOR_BASE .. MCA_COLL_BASE_TAG_NEIGHBOR_END ] which is only 1024 long).
  2. it changes all the collective components internal messaging. Which means we need to bump the component version up to break compatibility with older components
  3. I think 2 means we cannot do it in the middle of a stable series, soas is this PR is only meant for 6.0. If we want to bring it into the 5.x we should leave the HCOLL tag space as is, and only clean it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't going into v5.x only into into v6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I just wanted to make sure we are well aware of this. However, I think the change to the collective component version is still required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I'll adjust

This commit completely removes the mca/coll/hcoll component from the
Open MPI source code. The hcoll component provided integration with
Mellanox's Hierarchical Collectives library for collective operation
offload.

Changes include:
- Removed ompi/mca/coll/hcoll component directory and all source files
- Removed config/ompi_check_libhcoll.m4 configuration macro
- Updated coll_tags.h to remove HCOLL tag space definitions
- Updated coll_base_allgather.c and coll_base_allgatherv.c to use
  NEIGHBOR tag space instead of HCOLL tag space
- Removed hcoll references from platform configuration files
- Removed hcoll documentation and configuration options
- Removed hcoll references from code comments

Signed-off-by: Tomislav Janjusic <[email protected]>
@janjust
Copy link
Contributor Author

janjust commented Oct 28, 2025

@jsquyres @bosilca please re-review, second commit updates component versions

bosilca
bosilca previously approved these changes Oct 28, 2025
(1 == component->mca_type_minor_version &&
0 == component->mca_type_release_version))) {

return query_3_0_0((const mca_coll_base_component_3_0_0_t *)component, comm, priority, module);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have bumped to 3.1.0, please rename this function to query_3_1_0 -- the convention is to name everything according to the highest required version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course

@janjust
Copy link
Contributor Author

janjust commented Oct 28, 2025

@bosilca @jsquyres , I'm echoing my slack message here too but it just occured to me.
Why do we need to update the component version? It was already bumped in main in preparation to v6.0, since we branched, shouldn't that just remain v3.0.0? it should stay that way until we release, no?

@bosilca
Copy link
Member

bosilca commented Oct 28, 2025

If it was already bumped for 6.x then there is no need to bump it again. I missed that bump.

@jsquyres
Copy link
Member

Did you actually change the coll interface? I didn't review the entire PR -- I just saw that @bosilca told you to change the interface number to 3.1.0, and assumed that that was because you changed the coll interface.

If this PR isn't changing the coll interface, then there's no reason to update the coll interface version number to 3.1.0.

But if you are changing the coll interface, then yes, you do need to change the version number (e.g., to 3.1.0). The coll interface version is a different version number than the main OMPI version number (i.e., it's a different value than / not related to / not comparable to v5.0.x or v6.0.x).

@bosilca
Copy link
Member

bosilca commented Oct 28, 2025

He didn't change the API, he changed the space for the pt2pt tags used by collective components, which is as problematic as changing the API. I think what @janjust is saying is that there is release with the coll components are version 3.0.0 yet, because there is no 6.x release, so we might not need to bump the coll component version to 3.1.

@jsquyres
Copy link
Member

he change the space for the pt2pt tags used by collective components, which is as problematic as changing the API

Agreed - this is changing the interface.

Did we change the coll API version to 3.0.0 just because the main OMPI version went to 6.0.0? Or some other reason?

Regardless, now I grok: the v5.0.x series had coll API version < 3.0.0, so v6.0.x will be the first release with coll aPI version 3.0.0. So: no need to bump it again to 3.1.0.

@janjust
Copy link
Contributor Author

janjust commented Oct 28, 2025

that's right not changing the interface but i'm reverting the pt2pt tags used by different collective components.
And correct, we already bumped the interface to 3.0.0 in main, so no need to bump it again

@janjust
Copy link
Contributor Author

janjust commented Oct 28, 2025

ok great, I just removed the version bump, so can I get a thumbs up and let's merge this

@bosilca bosilca merged commit 54c4dbd into open-mpi:main Oct 28, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants